Using eigenvectors of the bigram graph to infer morpheme identity
نویسندگان
چکیده
This paper describes the results of some experiments exploring statistical methods to infer syntactic categories from a raw corpus in an unsupervised fashion. It shares certain points in common with Brown et at (1992) and work that has grown out of that: it employs statistical techniques to derive categories based on what words occur adjacent to a given word. However, we use an eigenvector decomposition of a nearest-neighbor graph to produce a two-dimensional rendering of the words of a corpus in which words of the same syntactic category tend to form clusters and neighborhoods. We exploit this technique for extending the value of automatic learning of morphology. In particular, we look at the suffixes derived from a corpus by unsupervised learning of morphology, and we ask which of these suffixes have a consistent syntactic function (e.g., in English, -ed is primarily a mark of verbal past tense, does but –s marks both noun plurals and 3 person present on verbs).
منابع مشابه
Using eigenvectors of the bigram graph to infer grammatical features and categories
This paper describes the results of some experiments exploring statistical methods to infer syntactic categories from a raw corpus in an unsupervised fashion. It shares certain points in common with Brown et at (1992) and work that has grown out of that: it employs statistical techniques to derive categories based on what words occur adjacent to a given word. However, we use an eigenvector deco...
متن کاملMorpheme-Based Language Modeling for Amharic Speech Recognition
This paper presents the application of morpheme-based and factored language models in an Amharic speech recognition task. Since using morphemes in both acoustic and language models results, mostly, in performance degradation due to acoustic confusability and since it is problematic to use factored language models in standard word decoders, we applied the models in a lattice rescoring framework....
متن کاملMorpho-syntactic Modeling of Korean with a Categorial Grammar
A morpho-syntactic categorial modeling for Korean is introduced. Variable categories and notations for word order treatment are newly invented and notations for elliptic arguments are also suggested. Incremental parsing technique of a morpheme graph is newly developed using the proposed categorial Korean modeling. Two heuristics based on the coverage of sub-trees and the part-of-speech bigrams ...
متن کاملFinite-state Transducer Base with Explicit Modeling of Ph
This article describes the design and the experimental evaluation of the first Hungarian large vocabulary continuous speech recognition (LVCSR) system. The architecture of the recognition system is based on the recently proposed weighted finite state transducer (WFST) paradigm. The task domain is the recognition of fluently read sentences selected from a major daily newspaper. Recognition perfo...
متن کاملFinite-state transducer based hungarian LVCSR with explicit modeling of phonological changes
This article describes the design and the experimental evaluation of the first Hungarian large vocabulary continuous speech recognition (LVCSR) system. The architecture of the recognition system is based on the recently proposed weighted finite state transducer (WFST) paradigm. The task domain is the recognition of fluently read sentences selected from a major daily newspaper. Recognition perfo...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره cs.CL/0207002 شماره
صفحات -
تاریخ انتشار 2002